7 research outputs found

    Accelerating K-mer Frequency Counting with GPU and Non-Volatile Memory

    Get PDF
    The emergence of Next Generation Sequencing (NGS) platforms has increased the throughput of genomic sequencing and in turn the amount of data that needs to be processed, requiring highly efficient computation for its analysis. In this context, modern architectures including accelerators and non-volatile memory are essential to enable the mass exploitation of these bioinformatics workloads. This paper presents a redesign of the main component of a state-of-the-art reference-free method for variant calling, SMUFIN, which has been adapted to make the most of GPUs and NVM devices. SMUFIN relies on counting the frequency of k-mers (substrings of length k) in DNA sequences, which also constitutes a well-known problem for many bioinformatics workloads, such as genome assembly. We propose techniques to improve the efficiency of k-mer counting and to scale-up workloads like SMUFIN that used to require 16 nodes of Marenostrum 3 to a single machine with a GPU and NVM drives. Results show that although the single machine is not able to improve the time to solution of 16 nodes, its CPU time is 7.5x shorter than the aggregate CPU time of the 16 nodes, with a reduction in energy consumption of 5.5x.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). We are also grateful to SandDisk for lending the FusionIO cards and to Nvidia who donated the Tesla K40c.Peer ReviewedPostprint (author's final draft

    Topology-aware GPU scheduling for learning workloads in cloud environments

    Get PDF
    Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef and Asser Tantawi for the valuable discussions. We also thank SC17 committee member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version

    Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads

    Get PDF
    The recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are becoming more complex and more computationally demanding. For this reasons they have been extended to support different computing architectures, such as GPUs and FPGAs, to leverage the form of parallelism typical of each of such architectures. The paper describes how a genomic workload such as k-mer frequency counting that takes advantage of a GPU can be offloaded to one or even more FPGAs. Moreover, it performs a comprehensive analysis of the FPGA acceleration comparing its performance to a non-accelerated configuration and when using a GPU. Lastly, the paper focuses on how, when using accelerators with a throughput-oriented workload, one should also take into consideration both kernel execution time and how well each accelerator board overlaps kernels and PCIe transferred. Results show that acceleration with two FPGAs can improve both time- and energy-to-solution for the entire accelerated part by a factor of 1.32x. Per contra, acceleration with one GPU delivers an improvement of 1.77x in time-to-solution but of a lower 1.49x in energy-to-solution due to persistently higher power consumption. The paper also evaluates how future FPGA boards with components (i.e., off-chip memory and PCIe) on par with those of the GPU board could provide an energy-efficient alternative to GPUs.Peer ReviewedPostprint (published version

    A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL

    Get PDF
    © 2020 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Conditional Restricted Boltzmann Machine (CRBM) is a promising candidate for a multidimensional system modeling that can learn a probability distribution over a set of data. It is a specific type of an artificial neural network with one input (visible) and one output (hidden) layer. Recently published works demonstrate that CRBM is a suitable mechanism for modeling multidimensional time series such as human motion, workload characterization, city traffic analysis. The process of learning and inference of these systems relies on linear algebra functions like matrix–matrix multiplication, and for higher data sets, they are very compute-intensive. In this paper, we present a configurable framework for CRBM based workloads for arbitrary large models. We show how to accelerate the learning process of CRBM with FPGAs and OpenCL, and we conduct an extensive scalability study for different model sizes and system configurations. We show significant improvement in performance/Watt for large models and batch sizes (from 1.51x up to 5.71x depending on the host configuration) when we use FPGA and OpenCL for the acceleration, and limited benefits for small models comparing to the state-of-the-art CPU solution.This work was supported by the European Research Council(ERC) under the European Union’s Horizon 2020 research andinnovation programme (grant agreements No 639595); the Min-istry of Economy of Spain under contract TIN2015-65316-P andGeneralitat de Catalunya, Spain under contract 2014SGR1051;the ICREA, Spain Academia program; the BSC-CNS Severo Ochoaprogram, Spain (SEV-2015-0493) and Intel Corporation, UnitedStatesPeer ReviewedPostprint (published version

    Enabling distributed key-value stores with low latency-impact snapshot support

    Get PDF
    Current distributed key-value stores generally provide greater scalability at the expense of weaker consistency and isolation. However, additional isolation support is becoming increasingly important in the environments in which these stores are deployed, where different kinds of applications with different needs are executed, from transactional workloads to data analytics. While fully-fledged ACID support may not be feasible, it is still possible to take advantage of the design of these data stores, which often include the notion of multiversion concurrency control, to enable them with additional features at a much lower performance cost and maintaining its scalability and availability. In this paper we explore the effects that additional consistency guarantees and isolation capabilities may have on a state of the art key-value store: Apache Cassandra. We propose and implement a new multiversioned isolation level that provides stronger guarantees without compromising Cassandra's scalability and availability. As shown in our experiments, our version of Cassandra allows Snapshot Isolation-like transactions, preserving the overall performance and scalability of the system.This work is partially supported by the Ministry of Science and Technology of Spain and the European Union’s FEDER funds (TIN2007-60625, TIN2012-34557), by the Generalitat de Catalunya (2009-SGR-980), by the BSC-CNS Severo Ochoa program (SEV-2011-00067), by the HiPEAC European Network of Excellence (IST- 004408, FP7-ICT-217068, FP7-ICT-287759), and by IBM through the 2008 and 2010 IBM Faculty Award program.Peer ReviewedPostprint (author’s final draft

    Performance in service of post-tensioned concrete bridges

    Get PDF
    SIGLEAvailable from British Library Document Supply Centre-DSC:99/12704 / BLDSC - British Library Document Supply CentreGBUnited Kingdo
    corecore